In [1]:
############################################
# 
# Marcus Bischof
# Divvy EDA : Chicago
#
############################################

# Operations
import pandas as pd
import numpy as np

# Image libs
from PIL import Image, ImageChops
from folium.raster_layers import ImageOverlay

# Custom functions
from functions_for_eda import *

# Data viz
from matplotlib import pyplot as plt
import seaborn as sns

# Maps
import folium
from folium import plugins

# Jupyter display
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

# Do we need to load raw .csv, and create a single memory efficient .pkl?
CREATE_SMALL_MEMORY_SET = False
if CREATE_SMALL_MEMORY_SET:
    create_memory_efficient_pkl()

# Do we want to break up the 860+mb memory efficient .pkl into 10 slices?
CREATE_SLICES_OF_MEMORY_EFFICIENT_PKL = False
if CREATE_SLICES_OF_MEMORY_EFFICIENT_PKL:
    create_slices_of_memory_efficient_pkl()

# One tenth of the divvy data, to be used for exploration.
df = pd.read_pickle('../data/interim/df_0_1000000.pkl')
# Neighborhoods and geo.
n_hood = load_geojson_neighborhood_data()
stations = pd.read_pickle('../data/processed/stations.pkl')

Divvy Data Description on Kaggle's Site


  • trip_idID attached to each trip taken
  • year year
  • month month
  • week week No.
  • day
  • hour
  • usertype
    • Customer is a rider who purchased a 24-Hour Pass
    • Subscriber is a rider who purchased an Annual - Membership
  • gender
  • starttime day and time trip started, in CST
  • stoptime day and time trip ended, in CST
  • tripduration time of trip in minutes
  • temperature
  • events
  • from_station_id ID of station where trip originated
  • from_station_name name of station where trip terminated
  • latitude_start station latitude
  • longitude_start station longitude
  • dpcapacity_start number of total docks at each station
  • to_station_id
  • to_station_name
  • latitude_end
  • longitude_end
  • dpcapacity_end number of total docks at each station

Custom Functions that I (mostly) wrote for this Analysis


  • To create a map.

    create_chicago_map()
  • To add points to a map. **Note: icon must be a font-awesome icon.

    add_points_to_map
    (
         folium_map_obj, color, icon, points
    )
  • To add a neighborhood overlay.

    add_neighborhood_overlay_to_map
    (
         folium_map_obj, neighborhood_name, color, n_hood_df_polylines
    )
  • To trim an image such that we can add it to a folium map element (I did not write this).

    trim(image_path)
In [2]:
m = create_chicago_map()
In [3]:
# Thanks to https://alysivji.github.io/getting-started-with-folium.html
stations_starts = stations[['lat', 'long']].values

# plot heatmap
m.add_child(plugins.HeatMap(stations_starts, radius=10))
m
Out[3]:

We see a signficant concentration of divvy stations in:

-  The loop
-  Northern neighborhoods on the lake like Lincoln Park 

For our analysis, let's first understand the data broadly.

We will then start with a neighborhood centric approach to analyzing the data. I believe that since neighborhoods contain residents that may share certain commonalities, we may see interesting trends among and between various neighborhoods.

In [4]:
df.head()
Out[4]:
trip_id year month week day hour usertype gender starttime stoptime tripduration temperature events from_station_id from_station_name latitude_start longitude_start dpcapacity_start to_station_id to_station_name latitude_end longitude_end dpcapacity_end from_neighborhood to_neighborhood same_station_trip same_neighborhood_trip
0 2355134 2014 6 27 0 23 Subscriber Male 2014-06-30 23:57:00 2014-07-01 00:07:00 10 68 tstorms 131 Lincoln Ave & Belmont Ave 41.939365 -87.668385 15 303 Broadway & Cornelia Ave 41.945512 -87.645980 15 Lake View Lake View False True
1 2355133 2014 6 27 0 23 Subscriber Male 2014-06-30 23:56:00 2014-07-01 00:00:00 4 68 tstorms 282 Halsted St & Maxwell St 41.864580 -87.646930 15 22 May St & Taylor St 41.869482 -87.655486 15 Little Italy, UIC Little Italy, UIC False True
2 2355130 2014 6 27 0 23 Subscriber Male 2014-06-30 23:33:00 2014-06-30 23:35:00 2 68 tstorms 327 Sheffield Ave & Webster Ave 41.921687 -87.653714 19 225 Halsted St & Dickens Ave 41.919936 -87.648830 15 Lincoln Park Lincoln Park False True
3 2355129 2014 6 27 0 23 Subscriber Female 2014-06-30 23:26:00 2014-07-01 00:24:00 58 68 tstorms 134 Peoria St & Jackson Blvd 41.877749 -87.649633 19 194 State St & Wacker Dr 41.887155 -87.627750 11 West Loop Loop False False
4 2355128 2014 6 27 0 23 Subscriber Female 2014-06-30 23:16:00 2014-06-30 23:26:00 10 68 tstorms 320 Loomis St & Lexington St 41.872187 -87.661501 15 134 Peoria St & Jackson Blvd 41.877749 -87.649633 19 Little Italy, UIC West Loop False False
In [5]:
g = sns.catplot(
    x="month", y="tripduration", hue="usertype",
    data=df, kind="violin"
)
In [6]:
df[['month', 'usertype', 'tripduration']].groupby(['month', 'usertype']).mean().dropna(how='any')
Out[6]:
tripduration
month usertype
1 Subscriber 9.275141
2 Subscriber 9.382594
3 Subscriber 9.945430
4 Subscriber 10.507275
5 Subscriber 11.367555
6 Subscriber 11.744595
7 Customer 18.666667
Subscriber 11.769901
9 Subscriber 10.969968

Trip Duration Analysis Below

It seems as though we are only getting customer data from July.

How much does trip duration vary across event types?

In [7]:
g = sns.catplot(
    x="events", y="tripduration",
    data=df, kind="violin"
)
plt.title('Do trip durations vary widely across weather events?')
Out[7]:
Text(0.5,1,'Do trip durations vary widely across weather events?')
In [8]:
g = sns.lineplot(x='temperature', y='tripduration', data=df.groupby(['temperature']).mean()['tripduration'].reset_index())
plt.title('Looking at our range of recorded temps, what is the average trip duration per temp?')
Out[8]:
Text(0.5,1,'Looking at our range of recorded temps, what is the average trip duration per temp?')

As suspected, tripduration goes up as temperature goes up. I suspect this is mostly due to the fact that people enjoy taking longer bike trips in nice weather.

Let's quickly confirm that this temperature data makes sense.

In [9]:
df['ym'] = df.starttime.apply(
    lambda x : str(x.split(' ')[0].split('-')[0]) + str(x.split(' ')[0].split('-')[1])
)

g = sns.lineplot(x='ym', y='temperature', data=df.groupby(['ym']).mean()['temperature'].reset_index().sort_values(by='ym'))
plt.title('Look through our date range and confirm average temps are valid.')
Out[9]:
Text(0.5,1,'Look through our date range and confirm average temps are valid.')
In [10]:
trip_durations_by_hood = df[['from_neighborhood', 'tripduration']].groupby(['from_neighborhood']).agg(['count', 'mean']).sort_values([('tripduration', 'mean')], ascending=False).reset_index()
In [11]:
trip_durations_by_hood.head()
Out[11]:
from_neighborhood tripduration
count mean
0 Edgewater 1551 18.249516
1 Museum Campus 5059 17.127100
2 Little Village 222 15.981982
3 Douglas 6749 14.659505
4 Gold Coast 13179 14.650049

Let's add the top 5 neighborhoods (by average tripduration) to the map. Anything in common here?

In [12]:
for top_hood in ['Edgewater', 'Museum Campus', 'Little Village', 'Douglas', 'Gold Coast']:
    add_neighborhood_overlay_to_map(m, top_hood, 'red', n_hood)
m
Out[12]:

I am certainly gettting the impression that the top neighborhoods in terms of average trip duration are actually neighborhoods with a small amount of stations, this makes sense.

We will do the following:

-  The top neighborhoods (1 --> 13): yellow
-  The middle neighborhoods (14 --> 26): green
-  The bottom neighborhoods (27 --> 39): blue
In [13]:
trip_durations_by_hood['color'] = pd.cut(np.array(trip_durations_by_hood['tripduration']['mean']), 3, labels=["yellow", "green", "blue"])
In [14]:
neighborhood_map = create_chicago_map()

for neighborhood in trip_durations_by_hood.itertuples():
    add_neighborhood_overlay_to_map(neighborhood_map, neighborhood[1], neighborhood[4], n_hood)

add_image_to_map(
    neighborhood_map, 'avg_trip_duration.png', 41.954883, -87.594551, 
    41.894883, -87.494551
)

neighborhood_map
Out[14]:
In [15]:
df[['week', 'tripduration']].groupby('week').mean().plot.bar(title="Avg. Trip Duration per Week")
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a24021da0>

Confirm below that week one corresponds to the first week of january as expected. The chart above makes sense now, as we would expect the average duration of trips to be shorter in the winter and longer in the summer.

In [16]:
df[df.week == 1].head()
Out[16]:
trip_id year month week day hour usertype gender starttime stoptime tripduration temperature events from_station_id from_station_name latitude_start longitude_start dpcapacity_start to_station_id to_station_name latitude_end longitude_end dpcapacity_end from_neighborhood to_neighborhood same_station_trip same_neighborhood_trip ym
576002 1110948 2014 1 1 6 14 Subscriber Male 2014-01-05 14:54:00 2014-01-05 15:02:00 7 12 rain or snow 197 Michigan Ave & Madison St 41.882134 -87.625125 19 191 Canal St & Monroe St 41.880700 -87.639470 23 Loop West Loop False False 201401
576003 1110947 2014 1 1 6 14 Subscriber Male 2014-01-05 14:47:00 2014-01-05 14:57:00 9 12 rain or snow 20 Sheffield Ave & Kingsbury St 41.909592 -87.653497 15 48 Larrabee St & Kingsbury St 41.897764 -87.642884 27 River North River North False True 201401
576004 1110946 2014 1 1 6 14 Subscriber Male 2014-01-05 14:38:00 2014-01-05 14:47:00 8 12 rain or snow 21 Aberdeen St & Jackson Blvd 41.877726 -87.654787 15 71 Morgan St & Lake St 41.885483 -87.652305 15 West Loop West Loop False True 201401
576005 1110936 2014 1 1 6 14 Subscriber Male 2014-01-05 14:16:00 2014-01-05 14:33:00 16 12 rain or snow 191 Canal St & Monroe St 41.880700 -87.639470 23 111 Sedgwick St & Huron St 41.894666 -87.638437 19 West Loop River North False False 201401
576006 1110935 2014 1 1 6 14 Subscriber Male 2014-01-05 14:03:00 2014-01-05 14:17:00 13 12 rain or snow 48 Larrabee St & Kingsbury St 41.897764 -87.642884 27 20 Sheffield Ave & Kingsbury St 41.909592 -87.653497 15 River North River North False True 201401

Station Capacity Analysis Below

We want to understand how capacity can potentially be analyzed and predicted.

Let's first understand how neighborhoods differ when it comes to the percentage of trips that end at a different neighborhood vs. trips that end in the same neighborhood. Likewise, let's examine the same statistic for station to station trips.

In [17]:
n_hood_different_neighborhood_ratios = []
for n in df.from_neighborhood.unique():
    n_hood_different_neighborhood_ratios.append((n, len(df[(df.from_neighborhood == n) & (df.to_neighborhood != n)]) / len(df[df.from_neighborhood == n])))
n_hood_different_neighborhood_ratios.sort(key=lambda tup: tup[1])
In [18]:
neighborhood_diff_trip_end_density = create_chicago_map()
for n, density in n_hood_different_neighborhood_ratios:
    add_neighborhood_overlay_to_map_with_fill(neighborhood_diff_trip_end_density, n, 'yellow', n_hood, density)

add_image_to_map(
    neighborhood_diff_trip_end_density, 'same_neighborhood_density.png', 41.954883, -87.594551, 
    41.894883, -87.394551
)
In [19]:
neighborhood_diff_trip_end_density
Out[19]:
  • We do notice that smaller neighborhoods tend to have more trips that end outside of them. This makes perfect sense riders are interested in going places, and the liklihood that a trips ends outside of a neighborhood boundry should probably increase the smaller a neighborhood is (or the vicinity of a station to a neighborhood edge for that matter).
  • Furthermore, we notice that neighborhoods on the outer edge (i.e edge divvy stations) tend to have a higher density, which may indicate that trips tend to gravitate towards the center. This makes sense to me for two reasons:
    • Out of sheer necessity, if there is not a divvy station further from an edge, it is more likely that bikers will go in directions (central) that have divvy stations.
    • Assuming a significant portion of traffic involves people going to work or going to activities (bars, skating rinks, you name it..), there is a higher concentration of workplaces and activities the closer to the transportation hubs that you are (i.e. OTC, Union Stations, CTA stops...)
In [20]:
df.head()
Out[20]:
trip_id year month week day hour usertype gender starttime stoptime tripduration temperature events from_station_id from_station_name latitude_start longitude_start dpcapacity_start to_station_id to_station_name latitude_end longitude_end dpcapacity_end from_neighborhood to_neighborhood same_station_trip same_neighborhood_trip ym
0 2355134 2014 6 27 0 23 Subscriber Male 2014-06-30 23:57:00 2014-07-01 00:07:00 10 68 tstorms 131 Lincoln Ave & Belmont Ave 41.939365 -87.668385 15 303 Broadway & Cornelia Ave 41.945512 -87.645980 15 Lake View Lake View False True 201406
1 2355133 2014 6 27 0 23 Subscriber Male 2014-06-30 23:56:00 2014-07-01 00:00:00 4 68 tstorms 282 Halsted St & Maxwell St 41.864580 -87.646930 15 22 May St & Taylor St 41.869482 -87.655486 15 Little Italy, UIC Little Italy, UIC False True 201406
2 2355130 2014 6 27 0 23 Subscriber Male 2014-06-30 23:33:00 2014-06-30 23:35:00 2 68 tstorms 327 Sheffield Ave & Webster Ave 41.921687 -87.653714 19 225 Halsted St & Dickens Ave 41.919936 -87.648830 15 Lincoln Park Lincoln Park False True 201406
3 2355129 2014 6 27 0 23 Subscriber Female 2014-06-30 23:26:00 2014-07-01 00:24:00 58 68 tstorms 134 Peoria St & Jackson Blvd 41.877749 -87.649633 19 194 State St & Wacker Dr 41.887155 -87.627750 11 West Loop Loop False False 201406
4 2355128 2014 6 27 0 23 Subscriber Female 2014-06-30 23:16:00 2014-06-30 23:26:00 10 68 tstorms 320 Loomis St & Lexington St 41.872187 -87.661501 15 134 Peoria St & Jackson Blvd 41.877749 -87.649633 19 Little Italy, UIC West Loop False False 201406
In [21]:
# Let's validate that many trips take place between typical work hours as hypothesized above.
df[['week','hour']].groupby(['hour']).count().reset_index().rename(columns={"week": "count"}).plot.bar(x='hour', y='count', title='Amount of trips that start during this hour (military hours)')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1d518390>

The bimodal distribution above with peaks at __8am__ in the morning and __5pm__ in the evening suggest that my hypothesis regarding work trips is at least partially correct.

In [22]:
df.head()
Out[22]:
trip_id year month week day hour usertype gender starttime stoptime tripduration temperature events from_station_id from_station_name latitude_start longitude_start dpcapacity_start to_station_id to_station_name latitude_end longitude_end dpcapacity_end from_neighborhood to_neighborhood same_station_trip same_neighborhood_trip ym
0 2355134 2014 6 27 0 23 Subscriber Male 2014-06-30 23:57:00 2014-07-01 00:07:00 10 68 tstorms 131 Lincoln Ave & Belmont Ave 41.939365 -87.668385 15 303 Broadway & Cornelia Ave 41.945512 -87.645980 15 Lake View Lake View False True 201406
1 2355133 2014 6 27 0 23 Subscriber Male 2014-06-30 23:56:00 2014-07-01 00:00:00 4 68 tstorms 282 Halsted St & Maxwell St 41.864580 -87.646930 15 22 May St & Taylor St 41.869482 -87.655486 15 Little Italy, UIC Little Italy, UIC False True 201406
2 2355130 2014 6 27 0 23 Subscriber Male 2014-06-30 23:33:00 2014-06-30 23:35:00 2 68 tstorms 327 Sheffield Ave & Webster Ave 41.921687 -87.653714 19 225 Halsted St & Dickens Ave 41.919936 -87.648830 15 Lincoln Park Lincoln Park False True 201406
3 2355129 2014 6 27 0 23 Subscriber Female 2014-06-30 23:26:00 2014-07-01 00:24:00 58 68 tstorms 134 Peoria St & Jackson Blvd 41.877749 -87.649633 19 194 State St & Wacker Dr 41.887155 -87.627750 11 West Loop Loop False False 201406
4 2355128 2014 6 27 0 23 Subscriber Female 2014-06-30 23:16:00 2014-06-30 23:26:00 10 68 tstorms 320 Loomis St & Lexington St 41.872187 -87.661501 15 134 Peoria St & Jackson Blvd 41.877749 -87.649633 19 Little Italy, UIC West Loop False False 201406
In [23]:
tmp = df[df.hour.isin([8, 17])].groupby(['from_neighborhood', 'hour']).count().sort_values(by='trip_id', ascending=True).reset_index()
tmp = tmp[['from_neighborhood', 'hour', 'trip_id']]
tmp.columns = ['from_neighborhood', 'hour', 'count']
ax = sns.barplot(x="from_neighborhood", y="count", hue="hour", data=tmp)
plt.xticks(rotation='vertical')
ax.set_title('Where do trips START: 8am vs. 5pm?')
Out[23]:
Text(0.5,1,'Where do trips START: 8am vs. 5pm?')

Notice: The major differences in trips originating in: The Loop, River North, Streeterville!

(For those unfamiliar with Chicago, these three neighborhoods are "downtown", and a lot of people work in these neighborhoods, they are not too residential).

Let's examine capacity as it relates to various stations.

In [24]:
df.head()
Out[24]:
trip_id year month week day hour usertype gender starttime stoptime tripduration temperature events from_station_id from_station_name latitude_start longitude_start dpcapacity_start to_station_id to_station_name latitude_end longitude_end dpcapacity_end from_neighborhood to_neighborhood same_station_trip same_neighborhood_trip ym
0 2355134 2014 6 27 0 23 Subscriber Male 2014-06-30 23:57:00 2014-07-01 00:07:00 10 68 tstorms 131 Lincoln Ave & Belmont Ave 41.939365 -87.668385 15 303 Broadway & Cornelia Ave 41.945512 -87.645980 15 Lake View Lake View False True 201406
1 2355133 2014 6 27 0 23 Subscriber Male 2014-06-30 23:56:00 2014-07-01 00:00:00 4 68 tstorms 282 Halsted St & Maxwell St 41.864580 -87.646930 15 22 May St & Taylor St 41.869482 -87.655486 15 Little Italy, UIC Little Italy, UIC False True 201406
2 2355130 2014 6 27 0 23 Subscriber Male 2014-06-30 23:33:00 2014-06-30 23:35:00 2 68 tstorms 327 Sheffield Ave & Webster Ave 41.921687 -87.653714 19 225 Halsted St & Dickens Ave 41.919936 -87.648830 15 Lincoln Park Lincoln Park False True 201406
3 2355129 2014 6 27 0 23 Subscriber Female 2014-06-30 23:26:00 2014-07-01 00:24:00 58 68 tstorms 134 Peoria St & Jackson Blvd 41.877749 -87.649633 19 194 State St & Wacker Dr 41.887155 -87.627750 11 West Loop Loop False False 201406
4 2355128 2014 6 27 0 23 Subscriber Female 2014-06-30 23:16:00 2014-06-30 23:26:00 10 68 tstorms 320 Loomis St & Lexington St 41.872187 -87.661501 15 134 Peoria St & Jackson Blvd 41.877749 -87.649633 19 Little Italy, UIC West Loop False False 201406
In [25]:
for station in df[df.from_neighborhood == 'Wicker Park'].from_station_name.unique():
    # Pick a station, calculate the amount of trips from that station per day, let's start with a station near and dear to my heart, Ashland Ave & Division St
    trips_from = df[df.from_station_name == station][
        ['dpcapacity_start', 'ym', 'same_neighborhood_trip']
    ].groupby(['ym', 'same_neighborhood_trip']).count().reset_index().sort_values(by='ym', ascending=True)
    trips_from.columns = ['year_month', 'same_hood', 'count']
    ax = sns.lineplot(x="year_month", y="count", hue="same_hood", data=trips_from)
    ax.set_title(station)
    plt.show()
In [26]:
for n in df.from_neighborhood.unique():
    # Pick a station, calculate the amount of trips from that station per day, let's start with a station near and dear to my heart, Ashland Ave & Division St
    trips_from = df[df.from_neighborhood == n][
        ['dpcapacity_start', 'ym', 'same_neighborhood_trip']
    ].groupby(['ym', 'same_neighborhood_trip']).count().reset_index().sort_values(by='ym', ascending=True)
    trips_from.columns = ['year_month', 'same_hood', 'count']
    ax = sns.lineplot(x="year_month", y="count", hue="same_hood", data=trips_from)
    ax.set_title('{}: Trips (count) ending in same neighborhoods vs. different neighborhoods'.format(n))
    plt.show()
  • What stands out to me the most above
    • There is a MASSIVE difference amoung the various neighborhoods in terms of sheer trip volume
      • E.g. The Loop and Lincoln Park are totally different beasts than Kenwood and Wicker Park
    • Neighborhoods that have close to similar parity between same neighborhood trips & diff neighborhood trips
    • Neighborhoods with different neighborhood trips domainating same neighborhood trips
In [27]:
g = sns.catplot(
    x="same_neighborhood_trip", y="tripduration", col="gender",
    data=df, kind="box"
)
plt.title('Examine trip duration as it relates to gender, does this looking different for same vs. diff neighborhood trips?\n\n')
Out[27]:
Text(0.5,1,'Examine trip duration as it relates to gender, does this looking different for same vs. diff neighborhood trips?\n\n')

Girls are representing and biking longer! (IQR 75th is higher)

How far off are capacities of from and to trips?

In [28]:
df['dpcapacity_start'] = df['dpcapacity_start'].astype('float')
df['dpcapacity_end'] = df['dpcapacity_end'].astype('float')
df['capacity_diff'] = df['dpcapacity_start'] - df['dpcapacity_end']
In [29]:
# Juicy information ...
inflow_outflow_df = df[['from_neighborhood', 'capacity_diff']].groupby(['from_neighborhood']).mean().reset_index()
outflow_max = inflow_outflow_df['capacity_diff'].max()
inflow_min = inflow_outflow_df['capacity_diff'].min()
In [30]:
inflow_outflow_map = create_chicago_map()
for _, n, density in inflow_outflow_df.itertuples():
    if density < 0:
        c, d = 'red', abs(density) / abs(inflow_min)
    else:
        c, d = 'green', density / outflow_max
    add_neighborhood_overlay_to_map_with_fill(inflow_outflow_map, n, c, n_hood, d)

add_image_to_map(
    inflow_outflow_map, 'inflow_outflow_rack_capacity.png', 41.954883, -87.594551, 
    41.894883, -87.394551
)
In [31]:
inflow_outflow_map
Out[31]:

Above is my favorite map.

Context

1) Red indicates that on average, and trips originating in a given neighborhood went from a station with dpcapacity LESS THAN the station where they ended the trip. Likewise, Green indicates that on average, and trips originating in a given neighborhood went from a station with dpcapacity MORE THAN the station where they ended the trip.

2) Essentially, this helps us understand the flow of traffic. What you are seeing is red neighborhoods MOSTLY flowing into green neighborhoods, and vice versa. These flows could aid in any effort to add new capacity OR rotate existing capacity. It makes sense that the residential areas tend to be red, while the working areas tend to be green.